In this notebook, we'll take the Natural Language Toolkit (NLTK) for a spin, and apply it to James Joyce's Ulysses to see how Ulysses is assembled - in particular, Chapter 8, which Joyce called "Lestrygonians." We'll start from the lowest level - the individual letters - and work our way upwards, to n-grams, then words, then phrases, sentences, and paragraphs. To do this, we'll be using NLTK's built-in functions to tokenize, parse, clean, and process text.
We'll start by importing some libraries.
In [135]:
# In case we want to plot something:
%matplotlib inline
from __future__ import division
import nltk, re
# The io module makes unicode easier to deal with
import io
We'll use NLTK's built-in word tokenizer to tokenize the text. There are many tokenizers available, both at the word and sentence level, but the word_tokenize()
function is the no-hassle option.
In [136]:
######################################
# Words, words, words.
# Start by splitting the text into word tokens.
# Create a word tokenizer object.
# This uses io.open() because that deals with unicode more gracefully.
tokens = nltk.word_tokenize(io.open('txt/08lestrygonians.txt','r').read())
The variable tokens
is now a list containing all the word-level tokens that resulted from word_tokenize()
:
In [137]:
print type(tokens)
print len(tokens)
In [138]:
print tokens[:21]
Now that we have all the words from this chapter in a list, we can create a new object of class Text. This is a wrapper around a sequence of tokens, and is designed to explore the text by counting, providing a concordance, etc. We'll create a Text object from the list of tokens that resulted from word_tokenize()
. The Text object also has some useful functions like findall()
, to search for particular words or phrases using regular expressions.
In [139]:
def p():
print "-"*20
# Start by creating an NLTK text object
text = nltk.Text(tokens)
p()
text.findall(r'<with> <.*> <and> <.*>')
p()
text.findall(r'<as> <.*> <as> <.*> <.*>')
p()
text.findall(r'<no> <\w+> <\w+>')
p()
text.findall(r'<\w+> <with> <the> <\w+>')
p()
text.findall(r'<see> <\w+> <\w+>')
In [140]:
p()
text.concordance('eye',width=65)
p()
text.concordance('Molly',width=65)
p()
text.concordance('eat',width=65)
Joyce's Ulysses is filled with colors. If we have a list of colors, we can pass them to the concordence one at a time, like so:
In [141]:
colors = ['blue','purple','red','green','white','yellow']#'indigo','violet']
for c in colors:
p()
text.concordance(c,width=65)
Chapter 8 of Ulysses, Lestrygonians, is named for the episode in Homer's Odyssey in which Odysseus and his crew encounter the island of the cannibal Lestrygonians. The language in the chapter very much reflects that. Here we use the count method of the Text class to count some meaty, sensory, organic words.
In [142]:
# Count a few words
def count_it(word):
print '"'+word+'" : ' + str(text.count(word))
words = ['eyes','eye','mouth','God','food','eat','knife','blood','son','teeth','skin','meat','flies','guts']
for w in words:
count_it(w)
Let's return to the results of word_tokenize()
again, which were stored in the tokens
variable. This contained about 15,000 words. We can use this as an all-inclusive wordlist, and filter out words based on certain criteria to get new wordlists. We can also use built-in methods for strings to process words. Here are two examples: one that converts all tokens to lowercase using a built-in method for strings, and one that extracts words with the suffix "-ed" using a regular expression.
In [143]:
# The tokens variable is a list containing
# a full wordlist, plus punctuation.
#
# We can make word lists by filtering out words based on criteria.
# Can use regular expressions or built-in string methods.
# For example, a list of all lowercase words using
# built-in string methods to filter:
#lowerlist = [w for w in tokens if w.islower()]
lowerlist = [w.lower() for w in tokens]
# and use this to find all "-ed" suffixes
# using a regular expression:
verbed = [w2 for w2 in lowerlist if re.search('ed$',w2) and len(w2)>4]
This also allows us to do things like compile lists of words that are either colors themselves, or that have color words in them. To find unique words, we use a set object, a built-in type for Python, and add words that contain color words ('blue','green', and so on). The color red tends to add lots of non-color words so is excluded.
In [144]:
colors = ['orange','yellow','green','blue','indigo','rose','violet']#,'red']
mecolors = set()
_ = [mecolors.add(w.lower()) for c in colors for w in tokens if re.search(c,w.lower()) ]
print mecolors
In [145]:
# Make a dictionary that contains total count of each character
charcount = {}
tot = 0
for word in lowerlist:
for ch in word:
if ch in charcount.keys():
charcount[ch] += 1
tot += 1
elif (re.match('^[A-Za-z]{1,}$', ch) is not None):
charcount[ch] = 1
tot += 1
# Make a dictionary that contains frequency of each character
charfrequencies = {}
keys = charcount.keys()
keys.sort()
for k in keys:
f = charcount[k]/(1.0*tot)
charfrequencies[k] = f*100
print "%s : %s : %s"%('char.','occurr.','freq.')
for k in charcount.keys():
print "%s : %04d times : %0.2f %%"%(k,charcount[k],charfrequencies[k])
Using values of character frequencies for a broader sample of the language spectrum, we can determine how closely the use of the alphabet in Lestrygonians matches the usage of the alphabet in common language, and whether there's a there there. We'll start by importing a dictionary that contains a key/value set of frequencies for each letter.
The following big of code makes sure that we're only dealing with the alphabet, and that the keys match between the Lestrygonians character frequency dictionary and the English language character frequency dictionary that we just imported.
In [146]:
from English import EnglishLanguageLetterFrequency
ufk = set([str(k) for k in charfrequencies.keys()])
efk = set([str(k) for k in EnglishLanguageLetterFrequency.keys()])
common_keys = ufk.intersection(efk)
The next step is to compute how large of a deviation from "normal" the frequency of this character is. We use the formula for percent difference. If the character frequency of a given character is given by $y$, and the character frequency of that character in the Lestrygonians chapter is denoted $y_L$ and the character frequency of that character in the English language is given by $y_{E}$, then the formula for percent difference is given by:
$$ \text{Pct Diff} = \dfrac{y_{L} - y_{E}}{y_{E}} $$
In [147]:
frequency_variation = {}
for k in common_keys:
clf = charfrequencies[k]
elf = EnglishLanguageLetterFrequency[k]
pd = ((clf-elf)/elf)*100
frequency_variation[k] = pd
In [148]:
for k in frequency_variation.keys():
print "%s : %0.2f %%"%(k,frequency_variation[k])
We see (and would expect) that there is more variation among the less common letters (higher sensitivity to variation).
However, there are some meaningful trends - if we pick out the letters that have a large positive percent difference, meaning they occur more in this chapter than they usually do in English, we find more harsh sounds: there are more b's, g's, j's and k's in this chapter than would be expected. These sounds are more guttural, cutting, blubbering letters, and fit with the character of Lestrygonians. Joyce's word selection in this chapter comes through even at the character frequency level.
The next thing we can do is use regular expressions to analyze the bigrams that appear in the chapter. Regular expressions allow us to seek out pairs of letters matching a particular pattern. Combined with list comprehensions, we can do some very powerful analysis with just a little bit of code.
For example, this one-liner iterates over each word token, grabs bigrams matching a particular regular expression, and uses the result to initialize a frequency distribution object, which enables quick and easy access to useful statistical information about the bigrams:
In [149]:
vowel_bigrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[aeiou]{2}',word))
The regular expression syntax '[aeiou]'
will match any vowels, and the '{2}'
syntax means, match exactly 2 occurrences of the pattern (two vowels in a row).
In [150]:
sorted(vowel_bigrams.items(), key=lambda tup: tup[1], reverse=True)
Out[150]:
By far the most common vowel-vowel bigram in Chapter 8 is "ou", with 554 occurrences, followed by "ea" (366 occurrences) and "oo" (304 occurrences) in a distant second and third place.
In [151]:
vc_bigrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[aeiou][^aeiou]',word))
sorted(vc_bigrams.items(), key=lambda tup: tup[1], reverse=True)[:20]
Out[151]:
In [152]:
cc_bigrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[^aeiou]{2}',word))
cc_bigrams.most_common()[:20]
Out[152]:
More guttural, slicing, cutting sounds populate the bigrams appearing in this chapter: "in", "er", "an", "on", "at" dominate the consonant-vowel bigrams, while "th", "ng", "st", "nd", and "ll" dominate the consonant-consonant bigrams.
We can also look at a tabulated bigram table - this requires a different type of frequency distribution. We can create a conditional frequency distribution, which tabulates information like "what is the frequency of the second letter, conditional on the value of the first letter?" Create a conditional frequency distribution and use the tabulate()
method:
In [153]:
cvs = [cv for w in lowerlist for cv in re.findall(r'[^aeiou][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()
In [154]:
cvs = [cv for w in lowerlist for cv in re.findall(r'[aeiou][b-df-hj-npr-tv-z]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()
In [155]:
vowel_ngrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[aeiou]{3,}',word))
vowel_ngrams.most_common()[:20]
Out[155]:
We can also look for consonant-vowel-vowel or vowel-vowel-consonant combinations,
In [156]:
cvv_ngrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[^aeiou][aeiou]{2}',word))
cvv_ngrams.most_common()[:20]
Out[156]:
In [157]:
vvc_ngrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[aeiou]{2}[^aeiou]',word))
vvc_ngrams.most_common()[:20]
Out[157]:
In [158]:
vcv_ngrams = nltk.FreqDist(vs for word in lowerlist for vs in re.findall(r'[aeiou][^aeiou][aeiou]',word))
vcv_ngrams.most_common()[:20]
Out[158]:
Some interesting results - "you", "out", and "ere" are the most common combinations of two vowels and a consonant, followed by "loo", "ear", "ose", and "ome". Plenty of o's in those syllables.
It's interesting to pick up on some of the patterns in bigrams and n-grams that occur in Lestrygonians, but how can we connect this better to the text itself? This is still a bit too abstract.
NLTK provides an Index object, which allows you to create a custom index with a list of words. Much like an index collects a variety of subjects under letter headings, and subject sub-headings, an Index object allows you to group information, but more powerfully - based on a custom criteria. This is similar to a Python dictionary type.
In this case, we'll cluster words based on the n-grams that they contain, so that we can look up all the words that contain a particular n-gram.
In [159]:
cc_ngram = [(cc,w) for w in lowerlist for cc in re.findall('[^aeiou]{2}',w)]
cc_index = nltk.Index(cc_ngram)
cv_ngram = [(cv,w) for w in lowerlist for cv in re.findall('[^aeiou][aeiou]',w)]
cv_index = nltk.Index(cv_ngram)
print "The following words were found using the bigram index."
print "The words are listed in order of appearance in the text."
p()
print "Bigram: pp"
print cc_index['pp']
p()
print "Bigram: bb"
print cc_index['bb']
p()
print "Bigram: pu"
print cv_index['pu']
p()
print "Bigram: ki"
print cv_index['ki']
If we revisit the most common trigrams, we can see which words cuased those trigrams to appear so often. Take the "you" trigram, which was the most common at 140 occurrences:
In [160]:
cvc_ngram = [(cv,w) for w in lowerlist for cv in re.findall(r'[^aeiou][aeiou][aeiou]',w)]
cvc_index = nltk.Index(cvc_ngram)
print "Number of words containing trigram 'you': " + str( len(cvc_index['you']) )
print "Number of unique words containing trigram 'you': " + str( len(set(cvc_index['you'])) )
print "They are:"
for t in set(cvc_index['you']):
print t
It's also more convenient to create an Index that looks at all trigrams, instead of just consonant-vowel-consonant or vowel-consonant-vowel. Modifying the regular expression:
In [161]:
aaa_ngram = [(cv,w) for w in lowerlist for cv in re.findall(r'[A-Za-z]{3}',w)]
aaa_index = nltk.Index(aaa_ngram)
def print_trigram(trigram):
print "Number of words containing trigram '"+trigram+"': " + str( len(aaa_index[trigram]) )
print "Number of unique words containing trigram '" + trigram + "': " + str( len(set(aaa_index[trigram])) )
print "They are:"
trigram_fd = nltk.FreqDist(aaa_index[trigram])
for mc in trigram_fd.most_common():
print "%32s : %4d"%(mc[0],mc[1])
p()
print_trigram('one')
p()
print_trigram('can')
In [162]:
import numpy as np
sizes = [len(word) for word in lowerlist]
mean_size = np.mean(sizes)
var_size = np.std(sizes)
print "Mean word size: " + str(mean_size)
print "Std. dev. in word size: " + str(var_size)
To compose Lestrygonians, Joyce's choice of words tended toward words containing more harsh consonants like b's, g's, j's, and k's; shorter word lengths (in keeping with the peristaltic nature of the chapter's language, masticating words, passing them down the mental esophagus one bit at a time (2-6 letters).
What about the percentage of vowels and consonants occurring in the text?
This should be possible with a one liner. We'll have a nested for loop - one loop over all lowercase word tokens, and one over all letters in the word token. Then we'll count up the number of matches of exactly 1 instance of [aeiou]
and the number of matches of exactly 1 instance of letters that are not aeiou
(note that [^aeiou]
will also count punctuation, we need to be careful), and that will give us our letter count.
In [163]:
# Count consonants and vowels
consonants = [re.findall('[b-df-hj-np-tv-z]',w) for w in lowerlist]
n_consonants = sum([len(e) for e in consonants])
vowels = [re.findall('[aeiou]',w) for w in lowerlist]
n_vowels = sum([len(e) for e in vowels])
print "Lestrygonians contains %d consonants and %d vowels, ratio of vowels:consonants is %0.2f"%(n_consonants,n_vowels,n_vowels/n_consonants)
We also see that interesting phenomena, whereby text is still partly or mostly comprehensible, even with all of the vowels removed:
In [164]:
''.join([''.join(t)+" " for t in consonants[:145]])
#for t in consonants[:21]:
# print ''.join(t)
Out[164]:
Let's revisit that text
object from earlier. Recall we created it using the code:
text = nltk.Text(tokens)
The Text object has made a determination of what words are similar, based on their appearance near one another in the text. We can obtain this list using the similar()
method. The results are interesting to consider:
In [165]:
text.similar('woman')
In [166]:
text.similar('man')
Notice the woman-night and man-day association, and the woman-bull and man-nannygoat association. Some curious things here...
Because of the way the Text class is implemented, these are printed directly to output, instead of being returned as a list or something convenient. See the NLTK source for details. But we can fix this by extending the Text class to make our own class, called the SaneText
class, to behave more sanely:
In [167]:
##################################################
# This block of code is extending an NLTK object
# to modify its behavior.
#
# This creates a SaneText object, which modifies the similar() method
# to behave in a sane way (return a list of words, instead of printing it
# to the console).
#
# Source: NLTK source code.
# Modified lines are indicated.
# http://www.nltk.org/_modules/nltk/text.html#ContextIndex.word_similarity_dict
from nltk.compat import Counter
from nltk.util import tokenwrap
class SaneText(nltk.Text):
def similar(self, word, num=20):
"""
This is copied and pasted from the NLTK source code,
but with print statements replaced with return statements.
"""
if '_word_context_index' not in self.__dict__:
#print('Building word-context index...')
self._word_context_index = nltk.ContextIndex(self.tokens,
filter=lambda x:x.isalpha(),
key=lambda s:s.lower())
# words = self._word_context_index.similar_words(word, num)
word = word.lower()
wci = self._word_context_index._word_to_contexts
if word in wci.conditions():
contexts = set(wci[word])
fd = Counter(w for w in wci.conditions() for c in wci[w]
if c in contexts and not w == word)
words = [w for w, _ in fd.most_common(num)]
#######
# begin changed lines
return tokenwrap(words)
else:
return u''
# end changed lines
######
#
# Done defining custom class
################################################
In [168]:
sanetext = SaneText(tokens)
def similar_words(w):
print "%32s : %s"%( w, sanetext.similar(w) )
similar_words('you')
similar_words('bread')
similar_words('beer')
similar_words('food')
similar_words('blood')
similar_words('meat')
similar_words('grave')
similar_words('right')
similar_words('priest')
similar_words('devil')
similar_words('coat')
similar_words('dress')
similar_words('life')
similar_words('babies')
similar_words('lamb')
similar_words('king')
similar_words('cut')
similar_words('fork')
In [169]:
similar_words('he')
similar_words('she')
similar_words('what')
similar_words('God')
Some curious and unexpected connections here - particularly, "God and "publichouse,"
Let's tag each sentence of the text with its part of speech (POS), using NLTK's built-in method (trained on a large corpus of data).
NOTE: It's important to be very skeptical of the part of speech tagger's results. The following does not focus on whether the parts of speech being tagged are correct, as that leads to other topics involving multiple words. Here, we show how to analyze the results of a part of speech tagger, not how to train one to be more accurate.
We can tag the parts of speech with the nltk.pos_tag()
method. This is a static method that results in a list of tuples. Each tuple represents a word and its part of speech tag. The sentence
I am Jane.
would be tagged:
I (PRON) am (VERB) Jane (NOUN) . (.)
and would be stored as the tuple:
[('I', 'PRON'),
('am', 'VERB'),
('Jane', 'NOUN'),
('.', '.')]
In [170]:
print nltk.pos_tag(['I','am','Jane','.'],tagset='universal')
We can either pass it a list (like the result of the nltk.word_tokenize()
method, and our tokens
variable above):
In [171]:
p()
print type(tokens[:21])
p()
print tokens[:21]
p()
print nltk.pos_tag(tokens[:21],tagset='universal')
or we can pass it an NLTK Text object, like our text
object above:
In [172]:
p()
t = nltk.Text(nltk.Text(tokens[:21]))
print type(t)
p()
print t
p()
print nltk.pos_tag(t,tagset='universal')
If we tag the part of speech of the entire text, we can utilize list comprehensions, built-in string methods, and regular expressions to pick out some interesting information about parts of speech in Chapter 8. For example, we can search for verbs and include patterns to ensure a particular tense, search for nouns ending in "s" only, or extract parts of speech and analyze frequency distributions.
We'll start by tagging the parts of speech of our entire text. Recall above our list of all tokens was stored in the variable tokens
:
tokens = nltk.word_tokenize(io.open('txt/08lestrygonians.txt','r').read())
We can feed this to the pos_tag()
method to get the fully-tagged text of Lestrygonians.
In [173]:
words_tags = nltk.pos_tag(tokens,tagset='universal')
Now use a list comprehension to extract tags from the word/tag combination (the variable words_tags
) and pass the tags to a frequency distribution object.
In [174]:
tag_fd = nltk.FreqDist(tag for (word, tag) in words_tags)
tag_fd.most_common()[:15]
summ = 0
for mc in tag_fd.most_common():
summ += mc[1]
for mc in tag_fd.most_common():
print "%s : %0.2f"%( mc[0], (mc[1]/summ)*100 )
An interesting observation - between the nouns, pepositions, conjunctions, and periods, you've got 45% of the chapter covered.
There is an implicit bias, in simple taggers, to tag unknown words as nouns, so that may be the cause for all the nouns. But Lestrygonians is a more earthy, organic, and sensory chapter, so it would make sense that Joyce focuses more on nouns, on things and surroundings, and that those would fill up the chapter.
We'll see later on that we can explore those parts of speech tags to see whether they were correct and where they went wrong, when we look at multi-word phrases.
By utilizing the built-in FreqDist object, we can explore the most common occurrences of various parts of speech. For example, if we create a FreqDist with a tagged version of Lestrygonians, we can filter the most common words by their part of speech:
In [175]:
# Create a frequency distribution from POS tag data
words_tags_fd = nltk.FreqDist(words_tags)
In [176]:
most_common_verbs = [wt[0] for (wt, _) in words_tags_fd.most_common() if wt[1] == 'VERB']
most_common_verbs[:30]
Out[176]:
In [177]:
most_common_nouns = [wt[0] for (wt, _) in words_tags_fd.most_common() if wt[1] == 'NOUN']
most_common_nouns[:20]
Out[177]:
Using a conditional frequency distribution, we can check on the probability that a particular part of speech will be a particular word - that is, finding the most frequent parts of speech. We do this by tabulating a conditional frequency distribution for two independent variables: the words, and the parts of speech. The conditional frequency distribution can then be queried for all of the words corresponding to a particular part of speech, and their frequencies.
In [178]:
cfd2 = nltk.ConditionalFreqDist((tag, word.lower()) for (word, tag) in words_tags)
Now we can, for example, print the most common numerical words that appear in the chapter, and we can see that one is the most common, followed by two, three, and five.
In [179]:
print cfd2['NUM'].most_common()
In [180]:
def pos_combo(pos1,pos2,pos3):
for i in range(len(words_tags)-2):
if( words_tags[i][1]==pos1 ):
if( words_tags[i+1][1]==pos2 ):
if( words_tags[i+2][1]==pos3 ):
print ' '.join([words_tags[i][0],words_tags[i+1][0],words_tags[i+2][0]])
pos_combo('ADJ','ADJ','NOUN')
In [181]:
pos_combo('ADV','ADV','VERB')
In [182]:
pos_combo('ADP','VERB','NOUN')
with sopping sippets
above shows an example of a mislabeled word - sopping
should be labeled as an adjective, but was mislabeled by a tagger that was indiscriminately labeling "ing" words as verbs (getting, writing, holding). Likewise, pudding
was accidentally tagged as a verb for the same reason. There's also an -ed
word, bloodhued
, which was, for some inexplicable reason, tagged as a verb, instead of an adjective.
In any case, we see that there is room for improvement, but it's an interesting way to explore the text nevertheless.
We already explored what parts of speech are the most common in this text. But suppose we're interested, now, in what combinations of parts of speech are the most common. To do this, we can construct a conditional frequency distribution based on the parts of speech.
If we're thinking about three-letter phrases, then we want to tabulate the frequency of particular combinations of three parts of speech show up. We can think of this as a probability distribution of three random variables taking on values from a set of nominal values ('NOUN', 'VERB', etc.). This is equivalent to a three-dimensional space chopped up into bins, with different frequencies occurring in each bin based on the words appearing in the text.
However, because NLTK's built-in conditional frequency object is only designed to handle two-dimensional data, we'll have to be a bit careful. We'll start by picking the first part of speech (thus eliminating one variable). Then we'll tabulate the conditional frequencies the combinations of parts of speech for the remaining two words.
In [183]:
def get_trigrams(tagged_words):
for i in range(len(tagged_words)-2):
yield (tagged_words[i],tagged_words[i+1],tagged_words[i+2])
trigram_freq = {}
for ((word1,tag1),(word2,tag2),(word3,tag3)) in get_trigrams(words_tags):
if( tag1 in trigram_freq.keys() ):
trigram_freq[tag1].append((tag2,tag3))
else:
trigram_freq[tag1] = [(tag2,tag3)]
adj_cf = nltk.ConditionalFreqDist(trigram_freq['ADJ'])
print adj_cf.tabulate()
This is a dense, interesting table. This table shows the likelihood of particular parts of speech occurring after adjectives.
To utilize this table, we begin by selecting a part of speech (in this case, adjective) that is the basis for the table. Next, we pick the part of speech of the second word, and select it from the labels on the left side of the table. (That is, the rows indicate the part of speech of the second word in our trigram). Finally, we pick the part of speech of the third word, and select it from the labels on the top of the table. (The columns indicate the part of speech fo the third word in our trigram.)
This gives the total occurrences of this combination of parts of speech.
The first thing we notice is that nouns are by far the most common part of speech to occur after adjectives - precisely what we would expect. Verbs are far less common as the second word in our trigram - the adjective-verb combination is rather unusual. But verbs are much more common as the third part of speech in the trigram.
Suppose we're interested in the combination adjective-pronoun-verb
, which the table tells us occurs exactly 10 times. We can print out these combinations:
In [184]:
pos_combo('ADJ','PRON','VERB')
The adjective-pronoun-verb
combination seems to happen when there is a change or transition in what the sentence is saying - the beginning of a new phrase.
On the other hand, if we look at the adjective-noun-adjective
combination, which is similarly infrequent, it shows words that look like they were intended to cluster together:
In [185]:
pos_combo('ADJ','NOUN','ADJ')
In [186]:
print sanetext
print dir(sanetext)
To deal more deftly with this information, and actually make good use of it, we would need to get a little more sohpisticated with how we're storing our independent variables.
This information could be used to reveal what combinations of parts of speech are most likely to be complete, standalone phrases. For example, the adjective-pronoun-verb combination does not result in words that are intended to work together in a single phrase. If we were to analyze three-word phrases beginning with pronouns, for example, we would see that pronouns followed by verbs are very common:
In [187]:
pro_cf = nltk.ConditionalFreqDist(trigram_freq['PRON'])
print pro_cf.tabulate()
The fact that the pronoun-verb combination occurs frequently, but the adjective-pronoun-verb combination does not, indicates that the adjective-pronoun-verb combination is unlikely to be a sensible phrase. By analyzing a few specific sentences and identifying issues with tags, as we did above, it's possible to improve the part of speech tagger. It's also possible to use a hierarchical part of speech tagger - one that starts by looking at three or four neighboring words, and determing the part of speech based on their part of speech. If that fails, it looks at two neighboring words, or one neighboring word, until the simplest case, when it looks at a single word itself. Many of the part of speech tags that failed were fooled by suffixes of words, and ignored their context.
To improve the tagging of parts of speech, we can use n-gram tagging, which is the idea that when you're tagging a word with a part of speech, it can be helpful to look at neighboring words and their parts of speech. However, in order to do this, we'll need a set of training data, to train our part of speech tagger.
Start by importing a corpus. The brown
corpus, described at the Brown Corpus wikipedia page, is approximately one million tagged English words, in a range of different categories. This is far more complete than the treebank
corpus, another tagged (but partially complete) corpus from the University of Pennsylvania (the Treebank project's homepage gives a 404, but see Treebank-3 from the Linguistic Data Consortium).
We'll use two data sets: one training data set, to train the part of speech tagger, using tagged sentences; and one test data set, to which we apply our part of speech tagger, and compare to a known result (tagged versions of the same sentences), enabling a quantitative measure of accuracy.
In [188]:
from nltk.corpus import brown
print dir(brown)
In [189]:
print brown.categories()
In [190]:
# Get tagged and untagged sentences of fiction
btagsent = brown.tagged_sents(categories='fiction')
bsent = brown.sents(categories='fiction')
Now split the data into two parts: the training data set and the test data set. The NLTK book suggests 90%/10%:
In [191]:
z = 0.90 # z is the test/train ratio
omz = 1 - z # one minus z
traincutoff = int(len(btagsent)*z)
traindata = btagsent[:traincutoff]
testdata = btagsent[traincutoff:]
In [192]:
# Import timeit so we can time how long it takes to train and test POS tagger
import timeit
In [193]:
start_time = timeit.default_timer()
# NLTK makes training and testing really easy
unigram_tagger = nltk.UnigramTagger(traindata)
perf1 = unigram_tagger.evaluate(testdata)
time1 = timeit.default_timer() - start_time
print perf1
print time1
Different taggers can be combined, in a certain order, like this:
In [194]:
start_time = timeit.default_timer()
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(traindata, backoff=t0)
t2 = nltk.BigramTagger(traindata, backoff=t1)
perf2 = t2.evaluate(testdata)
time2 = timeit.default_timer() - start_time
print perf2
print time2
In [195]:
print "Fitting improvement: %d %%"%(100*abs(perf2-perf1)/perf1)
print "Timing penalty: %d %%"%(100*abs(time2-time1)/time1)
Hmmmm......
In any case - if we want to save this model, we can save it in a pickle file:
In [196]:
from pickle import dump
output = open('lestrygonians_parser.pkl', 'wb')
dump(t2, output, -1)
output.close()
## To load:
# from pickle import load
# input = open('lestrygonians_parser.pkl', 'rb')
# tagger = load(input)
# input.close()
In [197]:
better_tags = t2.tag(tokens)
print "%40s\t\t%30s"%("original","better")
for z in range(110):
print "%35s (%s)\t%35s (%s)"%(words_tags[z][0],words_tags[z][1],better_tags[z][0],better_tags[z][1])
The analysis, from here, moves on to phrases and sentences, which will be covered in Part II.
In [ ]: